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Abstract 

Two prevailing theories for explaining social group or community struc- 
ture are cohesion and identity. The social cohesion approach posits that 
social groups arise out of an aggregation of individuals that have mu- 
tual interpersonal attraction as they share common characteristics. These 
characteristics can range from common interests to kinship ties and from 
social values to ethnic backgrounds. In contrast, the social identity ap- 
proach posits that an individual is likely to join a group based on an 
intrinsic self-evaluation at a cognitive or perceptual level. In other words 
group members typically share an awareness of a common category mem- 
bership. 

In this work we seek to understand the role of these two contrasting 
theories in explaining the behavior and stability of social communities 
in Twitter. A specific focal point of our work is to understand the role 
of these theories in disparate contexts ranging from disaster response to 
socio-political activism. We extract social identity and social cohesion 
features-of-interest for large scale datasets of five real-world events and 
examine the effectiveness of such features in capturing behavioral charac- 
teristics and the stability of groups. We also propose a novel measure of 
social group sustainability based on the divergence in group discussion. 
Our main findings are: 1) Sharing of social identities (especially physical 
location) among group members has a positive impact on group sustain- 
ability, 2) Structural cohesion (represented by high group density and low 
average shortest path length) is a strong indicator of group sustainability, 
and 3) Event characteristics play a role in shaping group sustainability, as 
social groups in transient events behave differently from groups in events 
that last longer. 

* Joint first authors. 

tKno.e.sis Center, Wright State University, {hemant,amit}@knocsis.org 
t Department of Computer Science and Engineering, The Ohio State University, 
{ruan,fuhry,srini}@cse.ohio-state.edu 



1 



1 Introduction 



Online social networks allow Internet users all over the globe to share informa- 
tion, exchange thoughts, and work collaboratively. All of those activities involve 
more than a single user, consequently, making questions on dynamics of online 
social groups worthy of study. Especially, what factors influence an online social 
group's formation, its growth, and its sustainability? 

Prevalence of online social networks in the last decade has enabled computer 
scientists to answer questions of group sustainability and evaluate their solutions 
with large-scale experiments [21 [25] [7j [20] . Despite the research progress made 
to date on community structure and group dynamics, there are at least three 
open questions to be answered: 

• What is the relation between the findings of past research on group sus- 
tainability using structural characteristics and socio-psychological theories 
of group dynamics? 

• How can existing theories on social group behavior guide us in identifying 
relevant features to model online social group sustainability? 

• Online social group's sustainability not only depends on group size, but 
also on the divergence of its discussion content. How do we quantify this 
notion and what are the social group characteristics pertaining to it? 

Over decades of study, social psychologists have proposed diverse expla- 
nations about the dynamics of a social group and its behavior. Two main 
frameworks among others are the social identity approach and social cohesion 
approach. 

Social Identity Approach: Social identity approach includes two closely re- 
lated theories: social identity theory [26) and self-categorization theory |28[ . In 
[26] . Tajfel defines the concept of social identity as "the individual's knowledge 
that he belongs to certain social groups together with some emotional and value 
significance to him of this group membership" . Therefore, group membership is 
the result of "shared self-identification" rather than "cohesive interpersonal 
relationship" , and such shared identity leads to cohesiveness and uniformity, 
among other features [37]. One commonly-cited evidence for social identity ap- 
proach is team sports, where teammates are representing the same organization 
(a school, a club, or a country) and they are well aware to sustain the reputation 
of their associated identity. We refer to this approach as "(social) identity" in 
the following sections. 

Social Cohesion Approach: Social cohesion approach views social groups 
from a different perspective. Its hypothesis is that the necessary and sufficient 
condition for individuals to work as a group is the cohesive social relation- 
ships between individuals. While social relationships exist for different rea- 
sons (e.g., kinship ties, or similar social values), we focus on a group's structural 
cohesion, the collective result of those social relationships. Here, we adopt the 
definition by Lott and Lott [11) that interprets cohesiveness as mutual attrac- 
tion between individuals, which is slightly different from that used in [5]. In 



2 



accordance with this definition, the positive correlation between group cohesion 
and group's performance has been reported on various types of groups [TH 13"]. 
We will denote this structural cohesion approach as "(social) cohesion" from 
now on. 

As noted above, social identity and social cohesion attribute group formation 
and sustainability to different factors. Identity approach posits that a social 
group is the result of members' collective awareness of some type of category 
membership. In contrast, the conjecture of cohesion approach is that mutual 
attractions among individual pairs make them a group, implying that structural 
cohesion of member connections determines the sustainability of a social group. 

To study the sustainability of social groups, a multitude of predictive models 
have been established to answer the question "How many users will a social 
group have in the future?". While group size and growth rate are intuitive 
measures, using them alone overlooks other important aspects in defining a 
social group's sustainability. One drawback, for example, is they do not capture 
the stability of group membership. Imagine that a group had five members 
previously, and later on four members left while nine new members joined in. 
Although the group doubles in size, the low retention rate will have negative 
impact on its long-term sustainability. Also, previous studies have not inspected 
the divergence of content generated in social groups. If each individual 
group member produces content of vastly different topics, it is harder for the 
community's voice to be heard. Content coherence is especially critical for online 
discussion groups founded with a dedicated purpose (e.g. political rally [5], 
disaster relief [21]), and it is not captured by group size at all. Given the 
limitations of the simple measure of group size, alternative definitions of group 
sustainability are needed. 

Main Results and Contribution: In this study, taking Twitter as our ex- 
perimental platform, we quantify theoretic notions of social cohesion and social 
identity approaches from social science that accommodates to the characteris- 
tics of online social networks. Social identity is computationally modeled via 
features of self-presentation in user profiles which could also encompass users' 
physical world identities. We represent social cohesion by structural features of 
the group's static friendship/follower network. These features incorporate guid- 
ance of the two theoretical approaches to capture users' social behavior from 
both physical and online world, and therefore, help us better understand the 
role of these theories in group behavior. 

Furthermore, we propose a novel measure of social group sustainability, topic 
divergence, based on the divergence of each individual member's discussion from 
the group's main line of discussion. Our two main hypotheses regarding group 
sustainability are: 

Hypothesis 1.1 The more structurally cohesive a social group is, the lower 
topic divergence the group has. 

Hypothesis 1.2 The more similar in identities a social group's members are, 
the lower topic divergence the group has. 
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From experiments on five real-world datasets, we observe that 1) Sharing 
of social identities (especially regional identity) among group members has a 
positive impact on group sustainability, 2) Structural cohesion (represented by 
high group density and low average shortest path length) is a strong indicator of 
group sustainability, 3) At least on Twitter, features based on the assumption 
of uni-directional interpersonal attraction have statistically equal explanatory 
power as features based on the assumption of mutual attraction, and 4) Event 
characteristics affect online social group sustainability. Notably, during tran- 
sient events like disasters, structurally cohesive social groups are less likely to 
exist, therefore, social identity of users can be utilized to create stable groups 
for help in the relief efforts. 

2 Related Work 

Social network analysis has received greater attention in the last decade as online 
social networks have been evolving faster than ever. Most of the studies took the 
path of network- or structure-centric approach to model community dynamics, 
aligning with more of the social cohesion approach. We discuss here some of 
the noteworthy studies covering different forms of group dynamics studied in 
the past. 

In the efforts to understand network structures of social networks at large 
scale, Mislove et al. [TO] presented a study of Flickr, YouTube, LiveJournal, 
and Orkut networks. Their results confirmed the power-law, small-world, and 
scale-free properties of online social networks and observed that those networks 
contain a densely connected core of high-degree nodes; and that this core links 
small groups of strongly clustered, low-degree nodes at the fringes of the net- 
work. For community structures, Leskovec et al. [5] studied the clustering 
problem on a wide range of real-world large networks and concluded that the 
ideal size for most community-like clusters was around 100 nodes. Kwak et 
al. [8] studied Twitter and presented various statistics for the entire Twitter- 
sphere, while reporting findings of a non-power-law follower distribution, a short 
effective diameter, and low reciprocity, and 4 degrees of separation in Twitter's 
follower network, differing from other human social networks. 

Following the structure-centric approach, link prediction and group forma- 
tion problems were studied by various researchers. Notably, Liben-Nowell and 
Klcinberg [TO] surveyed various unsupervised methods on the link prediction 
problem and conducted extensive experiments on co-authorship networks. Back- 
strom et al. [2] proposed a model for network membership, growth and evolu- 
tion by analyzing DBLP and LiveJournal social networks. They found that 
how individuals join communities and how communities grow depended on the 
underlying network structure, which supports structural cohesion in our discus- 
sion. Taking a different path of a user-centric approach, Shi et al. [2 5) studied 
the user behavior of joining communities on online forums. Among other fea- 
tures, authors studied the similarity between users and the similarity's relation 
with community overlap. Their results suggested that user similarity defined by 
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frequency of communication or number of common friends was inadequate to 
predict grouping behavior, but adding node/user-level features could improve 
the fit of the model. 

Among other notable efforts on group sustainability, Kairam et al. [7] ana- 
lyzed long term (two years) dynamics of communities and modeled future com- 
munity growth rate as a function of past growth or current size and age of the 
community. The study predicts growth rate and sustainability of the commu- 
nity and it was found that growth rate is correlated with current size and age of 
a group. For community sustainability, the size of the largest clique is the best 
feature. 

In contrast to group-level studies, some researchers focused on user-level 
studies and therefore, efforts were made to understand user demographic on 
social networks. A noteworthy study by Rao et al. [18] presented an approach 
for automatic creation of ethnic profiling of users, focusing on names as the 
key force. Building on the previous study, Pennacchiotti et al. [H] proposed a 
machine learning approach to user classification on Twitter by analyzing user's 
friends, user posts and profile information. 

In all of the discussion aforementioned, researchers modeled the group sus- 
tainability problem by either structural properties such as group size, or by 
evolution of volume in the content and activity. In our study, we present a sys- 
tematic theoretical underpinning for group behavior by modeling the identity 
and cohesion phenomena into features-of-interest that cover not only structural- 
and activity-centric features studied in the past, but also user's identity-level 
characteristics. Furthermore, we propose measures to enable fine-grained under- 
standing of group sustainability via content divergence, overcoming loopholes 
of size and growth rate based measures as discussed in Section [T] 

3 Modeling and Experiments 

Our experiment involves three major steps: 1) Identifying social groups, 2) 
Computing social identity and cohesion characteristics of users in the groups, 
and 3) Tracking the sustainability of the groups. Therefore, we first describe our 
data collection and social group identification approach, followed by quantitative 
modeling of each of the phenomena - social identity, social cohesion and group 
sustainability, necessary for experimentation of proposed research hypotheses in 
Section [T] 

3.1 Data Collection 

The Twitter Streaming API provides real-time tweet collection. Alternatively, 
the Twitter Search API provides keyword based search query, returning the 
1500 most recent tweets in one response and excluding tweets from users who 
opt for privacy. To study the community forming around topic discussions for a 
specific event (denoted as "event-oriented community"), we created a Streaming 
API based crawler that collected on-going tweet stream relevant to the event 
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based on a seed keyword set, similar to 20J . For a keyword k, we crawl all 
tweets that mention k, K, #k and jfK. The seed list of keywords and hashtags 
is kept up-to-date by first automatically collecting other hashtags and keywords 
that frequently appear in the crawled tweets and then manually selecting highly 
unambiguous hashtags and keywords from this list. We avoid the query drift 
problem by placing a human in the loop to ensure that ambiguous keywords 
are not crawled outside of context but only in combination with a contextually 
relevant keyword. One can also utilize a sophisticated computation method, 
such as Continuous Semantics framework |24] to model the evolving knowledge 
and for finding highly relevant keywords for an event, but that is not the focus 
in this paper. 

We also store associated metadata with the crawled tweets and for tweet 
posters, such as author location, followers and followees counts, description 
about the tweet poster, etc. We also crawl the social graph (i.e. follower list) 
of tweet posters who are part of the event-oriented community. For those users 
who activated privacy setting, no information was crawled, and their tweets 
were discarded from the dataset. 

To enable temporal analysis and reasoning, tweets are grouped into slices 
according to their associated time-stamp. In this paper each time slice is one 
day. Table [T] shows various statistics about the datasets, two of which are about 
natural disaster (Type "D"). 



Event Name 


Type 


Duration 


#Tweets 


#Users 


Hurricane Irene 


D 


08/24-09/19, 2011 


183K 


77K 


Hurricane Sandy 


D 


10/27-11/07, 2012 


4.9M 


1.8M 


India Anti-Corruption 


non-D 


11/05-12/02, 2011 


100K 


21K 


Occupy Wall Street 


non-D 


11/05-12/02, 2011 


2.1M 


331K 


Anti-SOPA 


non-D 


01/19-02/19, 2012 


744K 


389K 



Table 1: Twitter data statistics centered on diverse set of events (D = natural 
disaster event) 





Lasting 


Transient 


Loose 


Occupy Wall Street, 
India Anti-Corruption 


Hurricane Sandy, 
Hurricane Irene 


Compact 




Anti-SOPA 



Table 2: Event classification [17] based on event characteristics 

Analogous to [17] . we note that events possess varying characteristics on the 
dimensions of activity, social significance, participant types, etc. Therefore, we 
also show event-classification for our datasets in the Table[2] Loose and Compact 
event features reflect the nature of participants in the community, for example, 
the Anti-SOPA event was mostly driven by technology enthusiasts, a compact 
user set, and thus, it is a Compact event. Lasting and Transient features define 
the existence of vibe about the event, for example Occupy Wall Street protesters 
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were long discussed in the social media, while after a week of Hurricane Sandy, 
nobody cared much about it except the people involved in the rebuilding phase 
of the disaster. Also, Hurricane events can be thought of as unexpected while 
protest events as deterministic, due to their organized coordinated sub-events. 
On the other hand, the involvement of the population type can also be used 
to suggest global versus the local scope of the events, for example, Hurricane 
Irene being local due to local coordination actors vs. Anti-SOPA being global 
due to global coordination of actors. Such event characterization will help us to 
diagnose the effects of event characteristics on the performance of social identity 
and cohesion to explain group sustainability. 

3.2 Identifying Social Groups 

Given all users in an event- oriented community, it is necessary to identify appro- 
priate social groups on which quantitative analyses will be performed. Resultant 
social groups should reflect online interaction among users that is beyond simply 
using the same word in their tweets. Moreover, grouping criterion needs to be 
independent of any feature of social cohesion and social identity (defined in the 
following sections) so that the results are not biased. 

To that end, we propose an approach of clustering users based on their 
interactions, which can be either retweet, reply or mention. A graph is created 
to represent those relationships, where vertices stand for users and edges indicate 
at least one interaction between two users during the whole dataset duration. 
We use a multi-level graph clustering algorithm [55] to identify social groups, 
and remove groups that contain fewer than 10 members. We also remove groups 
that were active (i.e. at least one member posted a relevant tweet) in fewer than 
five time slices. Clustering parameters are tuned such that the average size of 
the resultant groups is around 100, an empirical size of compact communities 
as observed in [5]. Table [3] summarizes the information of each dataset's social 
groups. 





# Groups 


# Users 


# Users/ Group 


Hurricane Irene 


228 


21,615 


94.80 


Hurricane Sandy 


3,438 


340,401 


99.01 


India Anti-Corruption 


107 


11,899 


111.21 


Occupy Wall Street 


2,549 


239,927 


94.13 


Anti-SOPA 


1,389 


149,490 


107.62 



Table 3: Information of social groups identified from each event-oriented com- 
munity 

3.3 Quantifying Social Cohesion 

To study the structural cohesion of social groups in a quantitative manner, we 
extract information from Twitter users' follower-followee graph. For each social 
group, we construct its corresponding node-induced sub-graph from the follower 
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graph. Unlike many other online social network services, the follower relation 
on Twitter is directional, leading to three options when inducing the sub-graph: 

• Reciprocal: 

An undirected edge will be created between two users only when both of 
them are following each other. This choice directly reflects the assumption 
of mutual interpersonal attraction in social cohesion approach. Statistics 
include density, transitivity (i.e. global clustering coefficient), average lo- 
cal clustering coefficient, and maximum average length of pairwise shortest 
path over all connected components (short-named "average shortest path 
length"). 

• Undirected: 

An undirected edge will be created between two users if either of them is 
following the other. The underlying assumption is that a one-way inter- 
personal attraction is sufficient to keep the social group sustaining. Same 
group of statistics as in the reciprocal sub-graph are computed. 

• Directed: 

We also computed density and transitivity on the directed sub-graph for 
each social group, without converting to a undirected graph. 

We are especially interested in the comparison of cohesion statistics calcu- 
lated according to the reciprocal approach vs. the undirected approach. While 
both types of cohesion statistics reflect structural properties of social groups, 
the former encodes the condition of mutual attraction. From the perspective of 
social cohesion approach, the following hypothesis holds true: 

Hypothesis 3.1 Cohesion statistics of the reciprocal follower network are a 
better indicator of social group sustainability than that of the undirected follower 
network. 

The range for all cohesion statistics is [0, 1], except for the average shortest 
path length as shown in Table |4j We report observations on the statistics in 
Section|4] We also notice the usage of structural cohesion's namesake in existing 
sociology literature [131 129) , where it was defined as the minimum number of 
nodes one need to remove to disconnect a graph. We do not include this statistic 
as we find that almost all (more than 97% of total) social groups contain at least 
one fringe node (whose degree is one) or singleton, meaning the value of this 
statistic for most of the groups will be at most one. 

3.4 Quantifying Social Identity 

To quantify the social identity phenomenon, we extract identity features from 
the user profile information as well as activity, as we note that social behavior 
tends to associate the user with established identities via self-representation and 
with incentive-based identity via user actions (e.g., 'active celebrity on Twitter'). 
For instance, people from New York like to be called 'New Yorker', similarly 
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University of Michigan students present themselves with the identity of the 
institution as 'UMichigan' and computer engineers love to be called as 'hackers' 
or 'geeks'. We observe that there are various types of identity that we live with 
in our daily lives, ranging from regional, occupational expertise, organizational 
to cultural and religious identities, etc. We present our study covering some of 
these types in this paper. From profile information, we can use location and 
interests metadata to extract the following types of social identities and for each 
such identity, we compute the entropy of its distribution in every social group: 

• Regional Identity: 

Based on the 'location' field of the user in the Twitter profile metadata, 
we map users to various geographical regions which tends to make an 
identity in our daily lives, e.g., 'Indian' for an India based user, 'Brit' 
for a UK based user and 'New Yorker' for a New York based user. We 
choose to create state level and nation level identity of users in our study. 
Specifically, for an event, in the nation it occurs in, we map users belonging 
to the event's nation to the corresponding identity of states of that nation, 
while remaining users get a mapping of their respective national identities. 
We use Geonames dataset on Linked Open Data (LOD) and Google Maps 
API to convert user profile locations into latitude-longitude pairs as well 
as state and country level information. We note that this simple model of 
two regional levels (state and country) for identity can also be ported to 
a smaller scale (county and its next super-class, state) if an event is very 
specific to local interest. 

• Expertise Identity: 

Using 'description' metadata in the user profiles, we map users to oc- 
cupation and interests by entity spotting, which are also very common 
identities used in our daily lives, e.g., 'Researcher' or 'Artist' or 'NFL 
player'. We fetch occupation titles using knowledge base sources, such as 
Wikipedia and the US department of Labor Statistics reports. We extend 
this knowledge base by human in the loop, because new conventions of 
social media have given rise to new forms of occupational interests (e.g., 
'blogger in digital marketing') which are not present in the formal occu- 
pation knowledge bases. At last, we classify occupation interests into 10 
broader classes and thus give class labels to users, inspired by the domain 
classification on the news websites and also from the higher levels of oc- 
cupation classes in the knowledge bases: 

ACADEMICS, BUSINESS, POLITICS, TECHNOLOGY, BLOGGING, 
JOURNALISM, ART, SPORTS, MEDICAL, OTHERS 
We note that there can be more advanced methods to map user to exper- 
tise classes, but that is not our focus and we plan to keep exploration of 
more sophisticated methods for future work. 

Recent emergence in the services like Klout or Foursquare has brought a new 
convention of identity into our social lives where we participate in associat- 
ing ourselves with incentive based identities, e.g., 'Celebrity' by Klout, on 
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Foursquare as 'Mayor of Pier-39' for a popular San Francisco spot Pier-39. 
Therefore, in order to evaluate the effect of such identities derived from user 
actions in the social networks, we propose the following identity type based on 
the expertise presentation work of Purohit et. al |16) and influence and passivity 
work of Romero et. al [T§] : 

• Activity-Influence-Diffusion (AID) Identity: 

Based on user actions on the platform (Twitter here), we use three metrics 
that contribute for building a user's AID identity: activity, popularity and 
diffusion strength. We model the activity metric by number of posts of the 
user, popularity metric by number of mentions of the user and diffusion 
strength by number of retweets of the user's posts. We compute scores on 
each of the three metric dimensions and then consider the 50th percentile 
threshold to create two levels on each of the dimensions, giving rise to 8 
classes as shown in the Figure [TJ 

In contrast with regional and expertise identities which are meaningful in the 
physical world, AID identity is a virtual world identity exclusively defined in 
the cyber realm. From our knowledge, few attempts have been made to study 
the impact of both online and offline identities on social networks. 



POPULARITY 




.Disengaged 
Celebrity 



Active 
Celebrity 

Conversationalist 



Information 
Hub 



Figure 1: AID Identity for users based on three action metrics 

The range of identity statistics is from to ln(C), where C is the number of 
unique classes in an identity type. In Table [4] we summarize the basic informa- 
tion of each cohesion and identity statistic and report observations in Section 
[4j The upper bounds of identity entropy values are included in brackets. 

3.5 Measuring Social Group Sustainability 

As discussed in Section [TJ there are limitations of using size and growth rate 
to measure the sustainability of a social group. Especially, growth rate will not 
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| Hurricane Irene Hurricane Sandy | India Anti-Corruption | Occupy Wall Street Anti-SOPA 


Structural Cohesion Statistics - Directed 


Density 


0.03 ±0.05 


0.04 ±0.07 


0.01 ±0.01 


0.02 ±0.02 


0.03 ±0.06 


Transitivity 


0.22 ±0.18 


0.23 ±0.21 


0.20 ±0.22 


0.23 ±0.19 


0.20 ±0.22 


Structural Cohesion Statistics - Reciprocal 


Density 


0.02 ±0.04 


0.03 ±0.06 


0.00 ±0.01 


0.01 ±0.02 


0.02 ±0.05 


Transitivity 


0.17 ±0.19 


0.22 ±0.24 


0.14 ±0.23 


0.19 ±0.22 


0.18 ±0.25 


Avg. Clustering Coef. 


0.06 ± 0.08 


0.08 ±0.11 


0.02 ± 0.04 


0.05 ±0.06 


0.05 ±0.09 


Avg. Shortest Path Length 


2.24 ± 1.30 


2.16 ± 1.34 


1.71 ±0.99 


1.98 ±0.72 


1.75 ±0.97 


Structural Cohesion Statistics - Undirected 


Density 


0.04 ± 0.06 


0.05 ±0.08 


0.02 ± 0.02 


0.02 ± 0.03 


0.04 ±0.07 


Transitivity 


0.16 ±0.17 


0.20 ±0.20 


0.13 ±0.17 


0.18 ±0.17 


0.19 ±0.20 


Avg. Clustering Coef. 


0.13 ±0.12 


0.13 ±0.14 


0.07 ±0.09 


0.09 ±0.09 


0.11 ±0.13 


Avg. Shortest Path Length 


2.74 ±0.94 


2.74 ± 1.38 


2.53 ±0.81 


2.65 ±0.83 


2.38 ± 1.03 


Identity Statistics 


Regional Entropy 


2.60 ± 0.69(5.28) 


2.50 ±0.74(5.74) 


2.44 ± 0.26(4.94) 


2.75 ±0.51(5.65) 


2.23 ±0.81(5.53) 


Expertise Entropy 


1.80 ±0.24(2.30) 


1.14 ±0.43(2.30) 


1.75 ±0.17(2.30) 


1.67 ±0.19(2.30) 


1.51 ±0.36(2.30) 


AID Entropy 


0.92 ±0.23(2.08) 


0.99 ±0.21(2.08) 


1.16 ±0.26(2.08) 


1.18 ±0.24(2.08) 


1.07 ±0.22(2.08) 



Table 4: Mean and standard deviation of structural cohesion/identity statistics. 
Identity entropy upper bounds are listed in brackets. 



capture the group's discussion divergence as well as its membership stability. 
Here, we introduce two alternative measures of social group sustainability, the 
first of which incorporates the notion of group discussion divergence and the 
second reflects membership stability. 

3.5.1 Topic Divergence 

To quantify the novel notion of discussion divergence within a group, we first 
construct a dynamic topic model [4] and infer the topics of discussion. Input into 
the topic model is a collection of vocabulary vectors, each of which represents 
event-related tweets posted by an author and is indexed by discrete time-stamps. 
The vocabulary includes words and phrases pertaining to the event (described 



in Section 3.1), as well as hashtags with the leading '#' symbol stripped. The 
dynamic topic model has the advantage of modeling systematic topic shift (pre- 
sumably due to event's progress) automatically, which allows us to investigate 
the true difference of an individual member's topic distribution to the corre- 
sponding group's topic distribution at any given time. 

We let the number of topics K be 3, and use default settings for other pa- 
rameters for model inferenc^] In Table [5j we list each topic's top vocabulary 
(excluding the event name itself) at three different stages of the event (begin- 
ning, middle and end|^] The transition of topic content is continual and smooth, 

1 We used the implementation publicly available at https://code.google.eom/p/ 
princeton-statistical-learning/downloads/detail?name=dtm_release . tgz 

^For space constraint, we only show the lists of top words for Hurricane Bandy and Occupy 
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and each topic is semantic distinct. 



Hurricane Sandy 




Beginning 


Middle 


End 


Topic 1 


tropical storm 


red cross 


red cross 


east coast 


jersey shore 


staten island 


Canada 


caused 


mexico 


path 


staten island 


caused 


Topic 2 


new york 


new york 


new york 


state 


new jersey 


new jersey 


google 


hurricane katrina 


states 


android 


media 


hurricane katrina 


Topic 3 


frankenstorm 


frankenstorm 


frankenstorm 


halloween 


fema 


knicks 


east coast 


halloween 


fema 


atlantic 


mitt romney 


nyc 


Occupy Wall Street 




Beginning 


Middle 


End 


Topic 1 


occupy 


occupy 


occupy 


protest 


nl7 


oo 


movement 


nypd 


occupyla 


occupytogcthcr 


brooklyn bridge 


movement 


Topic 2 


movement 


nypd 


nypd 


us 


movement 


movement 


bahrain 


protest 


anonymous 


occupy movement 


time 


protest 


Topic 3 


occupy 


occupy 


p2 


oo 


p2 


tcot 


P 2 


tcot 


republican 


tcot 


oo 


teaparty 



Table 5: Top vocabulary of each topic at different event stages 

The inference process of the topic model returns a user's topic distribution 
at each time slice, denoted as /3* for user u at time t. Then we calculate the 
group topic distribution for group g at time t (g t ) as 



,Vi = 1,2,3, 



and the topic divergence of g t is defined as 



TD(g t ) = 



\9t\ 



(1) 



(2) 



where KL is the Kullback-Leibler divergence. Intuitively, this definition gauges 
the average divergence of each group member's topic distribution from the 



Wall Street, the two largest datasets 
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Divergence 


Stability 


Growth 


Hurricane Irene 


1.04 ±0.43 


0.53 ±0.10 


1.64 ±0.88 


Hurricane Sandy 


0.71 ±0.47 


0.60 ±0.19 


1.66 ±0.99 


India Anti-Corruption 


1.21 ± 0.40 


0.67 ±0.12 


1.59 ±0.59 


Occupy Wall Street 


1.34 ±0.38 


0.69 ± 0.14 


2.17 ± 1.70 


Anti-SOPA 


0.68 ±0.42 


0.51 ± 0.21 


1.91 ± 1.09 



Table 6: Mean and standard deviation of sustainability measures 



group's overall topic distribution. The greater the TD value, the stronger indi- 
cation of a group lacking conformity in discussion. 

3.5.2 Membership Stability 

The second sustainability measure we propose, called membership stability, ex- 
plicitly discounts a social group's size by its "total change" from the previous 
snapshot. For g t , its membership stability is defined as 

where A is the set symmetric difference operator. The symmetric difference 
of group member sets at two sequential time slices is the set of users that left 
the group AND users that newly joined the group. This definition is inspired by 
a similar idea in pQ, where the authors introduced the notion of stability index 
to perform behavioral analysis of individuals in evolutionary graphs. 

3.5.3 Growth Rate 

For comparison purposes, we also calculate the growth rate, a widely-used size- 
based sustainability measure, for each g t : 

GR(g t ) = M- (4) 
\9t-i\ 

Table [6] provides an overview of sustainability measure's range for each event, 
where mean and standard deviation are calculated from each social group's 
average sustainability measure over time. The values of topic divergence and 
growth rate spread more broadly, while the values of membership stability are 
more concentrated. 

Correlation between Cohesion/Identity Statistics and Sustainability 
Measures: We calculate the correlation coefficients between each social cohe- 
sion/identity statistic and each sustainability measure (topic divergence, mem- 
bership stability, growth rate). We filter out social groups that contain fewer 
than ten members or have been active in fewer than five time slices. Each social 
group emits a tuple in the form of (cohesion/identity statistics, mean of sus- 
tainability measure over time). Tables [7||8] and [9] summarize those values. Cells 
whose absolute value is greater than 0.25 are boldfaced. We will analyze those 
results in details in the next section. 
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| Hurricane Irene Hurricane Sandy | India Anti- Corruption | Occupy Wall Street | Anti-SOPA 


Structural Cohesion Statistics - Directed 


Density 


-0.33 


-0.33 


-0.14 


-0.11 


-0.33 


Transitivity 


0.10 


0.05 


0.06 


0.16 


0.07 


Structural Cohesion Statistics - Reciprocal 


Density 


-0.26 


-0.30 


-0.11 


-0.07 


-0.27 


Transitivity 


0.15 


0.06 


0.24 


0.19 


0.13 


Avg. Clustering Coef. 


0.17 


-0.11 


0.32 


0.16 


-0.01 


Avg. Shortest Path Length 


0.57 


0.20 


0.24 


0.43 


0.46 


Structural Cohesion Statistics - Undirected 


Density 


-0.35 


-0.34 


-0.14 


-0.13 


-0.36 


Transitivity 


0.11 


0.04 


0.02 


0.23 


0.11 


Avg. Clustering Coef. 


0.22 


-0.09 


0.05 


0.20 


0.00 


Avg. Shortest Path Length 


0.56 


0.28 


0.28 


0.37 


0.51 


Identity Statistics 


Regional Entropy 


0.43 


0.40 


0.28 


0.25 


0.58 


Expertise Entropy 


0.44 


0.64 


0.29 


0.18 


0.39 


AID Entropy 


0.47 


0.28 


0.24 


0.58 


0.36 



Table 7: Correlation coefficients between structural cohesion/identity statistics 
and topic divergence 



1 Hurricane Irene 


Hurricane Sandy 


India Anti- Corruption 


Occupy Wall Street 


Anti-SOPA 


Structural Cohesion Statistics - Directed 


Density 


0.16 


0.03 


0.01 


0.10 


0.08 


Transitivity 


0.13 


0.06 


0.01 


0.13 


0.07 


Structural Cohesion Statistics - Reciprocal 


Density 


0.21 


0.01 


-0.03 


0.12 


0.02 


Transitivity 


0.18 


0.03 


0.06 


0.16 


0.05 


Avg. Clustering Coef. 


0.28 


0.02 


0.10 


0.20 


0.04 


Avg. Shortest Path Length 


0.21 


0.04 


0.11 


0.28 


0.06 


Structural Cohesion Statistics - Undirected 


Density 


0.12 


0.05 


0.02 


0.09 


0.12 


Transitivity 


0.17 


0.08 


0.03 


0.22 


0.08 


Avg. Clustering Coef. 


0.25 


0.09 


0.03 


0.22 


0.16 


Avg. Shortest Path Length 


0.16 


0.05 


0.15 


0.19 


0.06 


Identity Statistics 


Regional Entropy 


-0.01 


0.02 


-0.03 


-0.12 


-0.03 


Expertise Entropy 


0.13 


0.15 


-0.04 


0.02 


-0.04 


AID Entropy 


0.42 


0.07 


0.47 


0.62 


0.04 



Table 8: Correlation coefficients between structural cohesion/identity statistics 
and membership stability 
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| Hurricane Irene Hurricane Sandy | India Anti- Corruption | Occupy Wall Street | Anti-SOPA 


Structural Cohesion Statistics - Directed 


Density 


-0.08 


-0.01 


0.03 


-0.14 


-0.06 


Transitivity 


-0.02 


-0.19 


0.07 


-0.10 


-0.07 


Structural Cohesion Statistics - Reciprocal 


Density 


-0.09 


-0.02 


-0.01 


-0.13 


-0.05 


Transitivity 


0.04 


-0.18 


-0.04 


-0.09 


-0.06 


Avg. Clustering Coef. 


0.03 


-0.17 


-0.13 


-0.14 


-0.08 


Avg. Shortest Path Length 


0.18 


-0.27 


-0.14 


-0.18 


-0.10 


Structural Cohesion Statistics - Undirected 


Density 


-0.08 


0.00 


0.03 


-0.13 


-0.06 


Transitivity 


0.01 


-0.20 


-0.06 


-0.12 


-0.08 


Avg. Clustering Coef. 


0.00 


-0.18 


0.00 


-0.17 


-0.11 


Avg. Shortest Path Length 


0.11 


-0.26 


-0.09 


-0.06 


-0.13 


Identity Statistics 


Regional Entropy 


0.20 


-0.21 


-0.07 


0.15 


-0.02 


Expertise Entropy 


0.05 


-0.16 


0.15 


-0.02 


-0.11 


AID Entropy 


0.02 


-0.43 


-0.50 


-0.43 


-0.28 



Table 9: Correlation coefficients between structural cohesion/identity statistics 
and growth rate 



4 Discussion 

In this section, we discuss the results from Section [3] and their implications. 
4.1 Identity and Cohesion Statistics 

We identify several interesting trends in the results reported in the Table |2J 
First, in general the entropy numbers^] are higher for the Occupy Wall Street 
and India Anti- Corruption events, the two on-the-ground political rally events, 
possibly because the offline interactions heavily involved in those events are not 
captured by online social identity statistics. Such distinction is most pronounced 
when comparing AID identity entropies of those two events with respect to 
the other three events. The social groups in these two events tend to revolve 
around opinion leaders who often help direct and orchestrate the movement 
(such individuals likely will have high AID values). Therefore social groups 
formed in those events generally have more diverse AID identity composition, 
reflecting the presence of opinion leaders as well as followers in groups. Next on 
the list, after these two events, is the Anti-SOPA rally, where Internet celebrities 
also play a leading role in influencing the discussion. Another finding from 
Table [4] is that groups have great divergence in terms of their memberships 
from different regions. This may simply be a reflection of the times and the fact 
that online social networks are bringing people closer together and that four 

3 Note, it is important to normalize these numbers against the maximum entropy possible 
for each case. 
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out of five events have had significant media attention (SOPA, the odd one out 
in terms of media attention, has the lowest regional entropy). Finally, we note 
that most events have low density values and their distributions of transitivity 
and clustering coefficient are often skewed toward zero. Both suggest sparse 
follower/followee connection in most social groups. 

4.2 Validating Hypotheses 

To validate the two hypotheses introduced in Section [TJ we check if the signs of 
correlation coefficients in Table [7] agree with the induction from the hypotheses 
as following: 

• Hypothesis |1.1| posits that a more cohesive social group has a lower topic 
divergence. Higher density, transitivity and clustering coefficient signify 
a more cohesive structure, as does the lower value of average shortest 
path length. Therefore, we find 1) group density's negative correlation 
with topic divergence as well as 2) the positive correlation between av- 
erage shortest path length and topic divergence are consistent with our 
hypothesis, suggesting group density and average shortest path length as 
sustainable group characteristics. On the other hand, the positive corre- 
lation with topic divergence for transitivity and clustering coefficient are 
in contrast with our hypothesis, as one would expect the social group 
with higher transitivity to have lower topic divergence. We suspect this 
counter-evidence has to do with the lack of triangles in social groups, as 
analyzed below. 

• In Hypothesis [OJ it is stated that if members of a social group are similar 
in identities, then the group should have low topic divergence. As iden- 
tity entropy rises when group members' identities become more evenly- 
distributed, the induction from this hypothesis is that the identity entropy 
has positive correlation with topic divergence. Our results agree with this 
induction, as all three identities (regional, expertise, and AID) have posi- 
tive correlation with topic divergence, for all events. 



4.3 Correlation Strength with Topic Divergence 

Identity Statistics: We note in Table [7] that social identity statistics (espe- 
cially regional identity entropy and AID identity entropy) have moderate to 
high positive correlation with topic divergence, implying a positive effect of 
identity characteristics on sustainability of the groups, and this holds true for 
all events. For social groups with stronger regional concentration, in-group dis- 
cussions tend to be more location-specific and consistent, leading to a smaller 
degree of member- wise topic divergence, compared with groups whose members' 
locations are more disperse. Similarly, the presence of users with similar exper- 
tise or interest domain in a social group tends to keep the scope of discussions 
more focused. For AID identity, we note that it is reflective of user actions, 



16 



thus, we suspect that for the sake of maintaining their incentive-based action 
identity by lesser change in their actions, users tend to maintain a pattern of 
focused topic discussions in the groups. 

Cohesion Statistics: For structural cohesion statistics, we find that patterns 
of correlation with topic divergence can be categorized into different groups: 

• First of all, triangle-based characteristics (global and average local clus- 
tering coefficient) show weak correlation with topic divergence in general. 
Many social groups have low clustering coefficients (see Table [4]) due to 
the lack of triangles in their follower networks, hence the weak correla- 
tion. For future work, we plan to alleviate this issue by performing graph 
symmetrization, which discovers hidden similarity between nodes by com- 
paring their inlink and outlink structures |23j . 

• Secondly, density statistics have moderate correlation with topic diver- 
gence for Hurricane Irene, Hurricane Sandy, and the Anti-SOPA rally, 
indicating that a better-connected social group tends to have a more co- 
hesive discussion. 

Why is this not the case for datasets of Occupy Wall Street and India 
anti-corruption movements? As mentioned in Section |4.1[ both of them 
are long-lasting events accompanied by an arguably more engaged offline 
component, whose information are not captured in cohesion statistics. 
Therefore, the density of online social groups is low (see Table H), making 
it less indicative of sustainability for those two events. 

• Finally, average shortest path length shows consistency in its positive cor- 
relation with topic divergence. Similar to other cohesion statistics, the 
average shortest path length reflects the "tightness" of a social group. 
Compared with others, average shortest path length shows clearer disper- 
sion in value, making the results of correlation analysis more meaningful. 

4.4 Reciprocal vs. Undirected Cohesion 

As introduced in Section[TJ the necessary and sufficient condition of social group 
formation via cohesion approach is the mutual attraction among group mem- 
bers. In our quantitative analysis, this translates to structural cohesion of the 
reciprocal follower graph, where two group members are connected only if they 
follow each other. We also derive a set of undirected structural cohesion statis- 
tics correspondingly, where two users are connected as long as either one is 
following the other. Therefore, undirected cohesion statistics reflect a weaker 
assumption that uni-directional interpersonal attraction is sufficient for social 
group sustainability. 

Is mutual attraction really necessary for structural cohesion, and thus for 
sustainability of social groups? That is, can we validate Hypothesis [XT]? Again, 
we turn to Table [7] for the answer, and perform one-sided binomial test on the 
relative strength of correlation between both sets of cohesion statistics and topic 
divergence. Our null hypothesis is as follows: 
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Hq: It is equally likely that the (correlation) coefficient between a reciprocal 
statistic and topic divergence has a higher or lower absolute value than that of 
the coefficient between the respective undirected statistic and topic divergence. 
Our alternative hypothesis, corresponding to Hypothesis |3.1[ is: 
H a : The probability that the (correlation) coefficient between a reciprocal 
statistic and topic divergence has a higher absolute value than that of the coef- 
ficient between the respective undirected statistic and topic divergence, is more 
than 0.5. 

The test hypotheses are analogous to the situation where one wants to de- 
termine if a coin is fair (H ), or its head is heavier than tail (H a ). Out of 20 
observations (4 statistics X 5 events), only 9 times does reciprocal statistic's 
coefficient have a higher absolute value, corresponding to a p-value of 0.7483. 
With such a large p-value, we cannot reject Hq in favor of H a , thus there is little 
evidence supporting Hypothesis |3.1| Therefore our results suggest that mutual 
attraction is not a necessary condition of structural cohesion and group sustain- 
ability. Note that, however, this should not be interpreted as the opposite belief 
that undirected cohesion statistics are a better indicator of topic divergence 
than reciprocal cohesion statistics. The p-value in that case is 0.4119, which is 
not significant either. 

4.5 Correlation with Other Measures of Sustainability 

Moving to Tables [8] and [9j we observe that none of the cohesion or identity 
statistics, except AID entropy, has a high correlation with either membership 
stability or growth rate across all datasets. It supports our argument that size- 
based measures for community sustainability may not be sufficient and need 
to be complemented by content coherence-based measures for enhanced under- 
standing of sustainability of social groups. 

4.6 Effects of Event Characteristics 

Tables [7j[8] and [9] highlight interesting differences in the effectiveness of cohesion 
and identity approaches in modeling sustainability across various event types: 

• Table [7] shows that transient types of events (Hurricane Irene, Sandy and 
Anti-SOPA) have better correlation of topic divergence measure (sustain- 
ability metric) with features of social identities as compared to those of 
social cohesion. It is perhaps due to the fact that groups in such volatile 
events form in an ad-hoc setting, where groups are less likely to have ex- 
isting cohesively connected users, undermining the effects of features cor- 
responding to social cohesion here. Therefore, discussions can be highly 
dependent on the characteristics of participants of the group, their per- 
sonal behavior and identities. 

• It is interesting to note a high correlation pattern for Anti-SOPA as com- 
pared to Occupy Wall Street and India Anti-corruption protest events, for 
both social identity and cohesion measures in Table [7] It may be due to 
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the nature of coordination, where one is a cyber protest, requiring bet- 
ter organization of activities online, thus more focused representation of 
activities (especially the one where major websites including Wikipedia 
had taken their content off, replacing it with black screen to protest), 
while Occupy Wall Street and India Anti-corruption are more ground-run 
protests and events were coordinated by physically meet-ups. 

5 Future Work 

We plan to extend our measures of social identity and cohesion with more fea- 
tures, such as ethnic and religious social relationships which can enhance our 
analysis with more insights into how real-world groups unfold over time. We 
also plan to perform proposed analyses on other social networks, such as Face- 
book, Linkedln and online forums, and on the co-authorship network of DBLP, 
to see if they show a similar social phenomena of group dynamics. We also plan 
to explore the usage of Twitter Lists subscriptions to create new forms of social 
cohesion and identity measures. 

6 Conclusion 

This study focuses on characterizing online social group sustainability by socio- 
psychological theories of group bonding and attachment - social identity and 
social cohesion. This study on Twitter is not only the first to quantify theoretic 
notions of identity and cohesion in the social groups, but also to present var- 
ious approaches to model sustainability of the group beyond past approaches 
of structure-based properties such as group size. Features inspired by both 
theories are found to correlate with social group sustainability well. We also 
observe an effect of event characteristics on stability of the groups and report 
our observations by large scale experimentation on a diverse set of real-world 
events. 
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